Search CORE

50 research outputs found

Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio -- Episode 1: Machine Transcription of the Manuscripts

Author: Firmani Donatella
Maiorino Marco
Merialdo Paolo
Nieddu Elena
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored in the VSA, producing good transcriptions for significant portions of the manuscripts. We propose an original approach based on character segmentation. Our solution is able to deal with the dirty segmentation that inevitably occurs in handwritten documents. We use a convolutional neural network to recognize characters and language models to compose word transcriptions. Our approach requires minimal training efforts, making the transcription process more scalable as the production of training sets requires a few pages and can be easily crowdsourced. We have conducted experiments on manuscripts from the Vatican Registers, an unreleased corpus containing the correspondence of the popes. With training data produced by 120 high school students, our system has been able to produce good transcriptions that can be used by paleographers as a solid basis to speedup the transcription process at a large scale.Comment: Donatella Firmani, Marco Maiorino, Paolo Merialdo, and Elena Nieddu. 2018. Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio - Episode 1: Machine Transcription of the Manuscripts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 263-27

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Roma 3

Archivio della ricerca- Università di Roma La Sapienza

Hierarchical Entity Resolution using an Oracle

Author: Barna Saha
Divesh Srivastava
Donatella Firmani
Sainyam Galhotra
Publication venue: place:New York
Publication date: 01/01/2022
Field of study

In many applications, entity references (i.e., records) and entities need to be organized to capture diverse relationships like type-subtype, is-A (mapping entities to types), and duplicate (mapping records to entities) relationships. However, automatic identification of such relationships is often inaccurate due to noise and heterogeneous representation of records across sources. Similarly, manual maintenance of these relationships is infeasible and does not scale to large datasets. In this work, we circumvent these challenges by considering weak supervision in the form of an oracle to formulate a novel hierarchical ER task. In this setting, records are clustered in a tree-like structure containing records at leaf-level and capturing record-entity (duplicate), entity-type (is-A) and subtype-supertype relationships. For effective use of supervision, we leverage triplet comparison oracle queries that take three records as input and output the most similar pair(s). We develop HierER, a querying strategy that uses record pair similarities to minimize the number of oracle queries while maximizing the identified hierarchical structure. We show theoretically and empirically that HierER is effective under different similarity noise models and demonstrate empirically that HierER can scale up to million-size datasets

Archivio della ricerca- Università di Roma La Sapienza

CERTEM: Explaining and Debugging Black-box Entity Resolution Systems with CERTA

Author: Divesh Srivastava
Donatella Firmani
Nick Koudas
Paolo Merialdo
Tommaso Teofili
Publication venue: place:New York
Publication date: 01/01/2022
Field of study

Entity resolution (ER) aims at identifying record pairs that refer to the same real-world entity. Recent works have focused on deep learning (DL) techniques, to solve this problem. While such works have brought tremendous enhancements in terms of effectiveness in solving the ER problem, understanding their matching predictions is still a challenge, because of the intrinsic opaqueness of DL based solutions. Interpreting and trusting the predictions made by ER systems is crucial for humans in order to employ such methods in decision making pipelines. We demonstrate certem an explanation system for ER based on certa, a recently introduced explainability framework for ER, that is able to provide both saliency explana- tions, which associate each attribute with a saliency score, and counterfactual explanations, which provide examples of values that can flip a prediction. In this demonstration we will showcase how certem can be effectively employed to better understand and debug the behavior of state-of-the-art DL based ER systems on data from publicly available ER benchmarks

Archivio della ricerca- Università di Roma La Sapienza

Bridging the Gap between Buyers and Sellers in Data Marketplaces with Personalized Datasets

Author: Firmani Donatella
Mathew Jerin George
Santoro Donatello
Simonini Giovanni
Zecchini Luca
Publication venue
Publication date: 01/01/2023
Field of study

Sharing, discovering, and integrating data is a crucial task and poses many challenging spots and open research direction. Data owners need to know what data consumers want and data consumers need to find datasets that are satisfactory for their tasks. Several data market platforms, or data marketplaces (DMs), have been used so far to facilitate data transactions between data owners and customers. However, current DMs are mostly shop windows, where customers have to rely on metadata that owners manually curate to discover useful datasets and there is no automated mechanism for owners to determine if their data could be merged with other datasets to satisfy customers’ desiderata. The availability of novel artificial intelligence techniques for data management has sparked a renewed interest in proposing new DMs that stray from this conventional paradigm and overcome its limitations. This paper envisions a conceptual framework called DataStreet where DMs can create personalized datasets by combining available datasets and presenting summarized statistics to help users make informed decisions. In our framework, owners share some of their data with a trusted DM, and customers provide a dataset template to fuel content-based (rather than metadata-based) search queries. Upon each query, the DM creates a preview of the personalized dataset through a flexible use of dataset discovery, integration, and value measurement, while ensuring owners’ fair treatment and preserving privacy. The previewed datasets might not be pre-defined in the DM and are finally materialized upon successful transaction

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Experiences and Lessons Learned from the SIGMOD Entity Resolution Programming Contests

Author: Bergamaschi Sonia
Chu Xu
De Angelis Andrea
Firmani Donatella
Li Peng
Mazzei Maurizio
Merialdo Paolo
Piai Federico
Simonini Giovanni
Wu Renzhi
Zecchini Luca
Publication venue
Publication date: 01/01/2023
Field of study

We report our experience in running three editions (2020, 2021, 2022) of the SIGMOD programming contest, a well-known event for students to engage in solving exciting data management problems. During this period we had the opportunity of introducing participants to the entity resolution task, which is of paramount importance in the data integration community. We aim at sharing the executive decisions, made by the people co-authoring this report, and the lessons learned

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Preface of the 31st Italian Symposium on Advanced Database Systems

Author: Amato Flora
Atzori Maurizio
Baralis Elena
Bartolini Ilaria
Bellomarini Luigi
Buccafurri Francesco
Cabibbo Luca
Calvanese Diego
Calí Andrea
Camporese Antonio
Caruccio Loredana
Castano Silvana
Catania Barbara
Ceci Michelangelo
Chiusano Silvia
Ciaccia Paolo
Corradini Enrico
Crescenzi Valter
De Antonellis Valeria
Di Noia Tommaso
Diamantini Claudia
Faggioli Guglielmo
Fazzinga Bettina
Ferrara Alfio
Ferrari Elena
Ferro Nicola
Firmani Donatella
Garza Paolo
Giachelle Fabio
Golfarelli Matteo
Greco Sergio
Guerrini Giovanna
Gullo Francesco
Guzzi Pietro Hiram
Irrera Ornella
Lanti Davide
Lembo Domenico
Leoncini Debora
Leotta Francesco
Manco Giuseppe
Mandreoli Federica
Marchesin Stefano
Masciari Elio
Maurino Andrea
Melchiori Michele
Menotti Laura
Mircoli Alex
Missier Paolo
Molinaro Cristian
Montanelli Stefano
Moscato Vincenzo
Papotti Paolo
Pasin Andrea
Pensa Ruggero G.
Piantella Davide
Pugliese Andrea
Quaggio Elisa
Quintarelli Elisa
Renso Chiara
Rinzivillo Salvatore
Sartiani Carlo
Savo Domenico Fabio
Silvello Gianmaria
Simonini Giovanni
Storti Emanuele
Tagarelli Andrea
Tanca Letizia
Publication venue
Publication date: 10/09/2023
Field of study

This volume contains the proceedings of the 31st Italian Symposium on Advanced Database Systems (SEBD - Sistemi Evoluti per Basi di Dati), held in Galzinagno Terme (Padua, Italy) from 2 to 5 July 2023.</p

University of Birmingham Research Portal

Preface of the 31st Italian Symposium on Advanced Database Systems

Author: Amato Flora
Atzori Maurizio
Baralis Elena
Bartolini Ilaria
Bellomarini Luigi
Buccafurri Francesco
Cabibbo Luca
Calvanese Diego
Calí Andrea
Camporese Antonio
Caruccio Loredana
Castano Silvana
Catania Barbara
Ceci Michelangelo
Chiusano Silvia
Ciaccia Paolo
Corradini Enrico
Crescenzi Valter
De Antonellis Valeria
Di Noia Tommaso
Diamantini Claudia
Faggioli Guglielmo
Fazzinga Bettina
Ferrara Alfio
Ferrari Elena
Ferro Nicola
Firmani Donatella
Garza Paolo
Giachelle Fabio
Golfarelli Matteo
Greco Sergio
Guerrini Giovanna
Gullo Francesco
Guzzi Pietro Hiram
Irrera Ornella
Lanti Davide
Lembo Domenico
Leoncini Debora
Leotta Francesco
Manco Giuseppe
Mandreoli Federica
Marchesin Stefano
Masciari Elio
Maurino Andrea
Melchiori Michele
Menotti Laura
Mircoli Alex
Missier Paolo
Molinaro Cristian
Montanelli Stefano
Moscato Vincenzo
Papotti Paolo
Pasin Andrea
Pensa Ruggero G.
Piantella Davide
Pugliese Andrea
Quaggio Elisa
Quintarelli Elisa
Renso Chiara
Rinzivillo Salvatore
Sartiani Carlo
Savo Domenico Fabio
Silvello Gianmaria
Simonini Giovanni
Storti Emanuele
Tagarelli Andrea
Tanca Letizia
Publication venue
Publication date: 10/09/2023
Field of study

University of Birmingham Research Portal

On unifying the space of l0-sampling algorithms

Author: Cormode Graham
Firmani Donatella
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2013
Field of study

The problem of building an l0-sampler is to sample nearuniformly from the support set of a dynamic multiset. This problem has a variety of applications within data analysis, computational geometry and graph algorithms. In this paper, we abstract a set of steps for building an l0-sampler, based on sampling, recovery and selection. We analyze the implementation of an l0-sampler within this framework, and show how prior constructions of l0-samplers can all be expressed in terms of these steps. Our experimental contribution is to provide a first detailed study of the accuracy and computational cost of l0-samplers

Crossref

Archivio della Ricerca - Università di Roma 3